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Abstract 


This work focuses on two major aspects of a parallelizing compiler for Fortran-D: 
data dependence analysis and loop restructuring. The traditional approach of data 
flow analysis, employed by compilers for sequential machines, is not sufficient for 
exploiting the potential parallelism present in the loops. The concept of data depen- 
dence analysis captures the reference pattern of the arrays in the loops. FRAMES, 
the earlier version of the Fortran-D compiler, developed by our group, relied on 
two primitive tests, namely, the GCD and Banerjee’s tests, for data dependence 
analysis. These tests are highly conservative in nature and hence, fail to extrax:t 
the full amount of parallelism present in the scientific programs. More sophisti- 
cated dependence tests are implemented, which efficiently tackle coupled subscripts, 
trapezoidal regions and symbolic variables. A neat interface to data dependence 
analysis is also provided. This interface enables programmers to incorporate new 
dependence tests without difficulty. 

The loop restructuring deals with transforming the loop-nests to map onto 
the underlying architecture. The validity of these transformations is ensured by 
preserving the data dependence relations provided by the data dependence analysis. 
Some sophisticated loop restructuring techniques: loop interchange, cycle shrinking, 
loop distribution and loop fusion, are implemented. These restructuring techniques 
significantly enhance the parallelizing capability of FRAMES. 

This thesis discusses the design and implementation issues involved in the above 
mentioned aspects of FRAMES. 
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Chapter 1 
Introduction 


1.1 Introduction 

The speed of sequential machines has approached its theoretical limits and parallel 
computers seems to be the only way to increase the computational power. Parallel 
computers, however, are not useful unless they are easily programmable. There are 
two aspects that discourage scientists to use any parallel machine. Firstly, there is 
no programming language available that enables scientists to easily write parallel 
programs. The second aspect is the lack of an automatic restructuring compiler to 
parallelize already existing billions of lines of sequential code. The project FRAMES 
at IIT, Kanpur addresses the second issue mentioned above. FRAMES [15, 14, 5, 9] 
is a restructuring compiler, which converts code written in Fortran-77 into Fortran- D 
[7], an extension of Fortran for MIMD architecture. 


1.2 Structure of FRAMES 

There are four main components of FRAMES. 

1. Front end. 

2. Data dependence analyzer. 

3. Restructurer. 
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4. Scheduler. 

1.2.1 Front end 

The Front end parses the input program, performs interprocedural anzilysis and 
finally, applies various optimizations. The parser generates the abstract syntax tree 
(AST) corresponding to the input program. The AST is used for interprocedural 
analysis and scalar optimizations. Some of the scalar optimizations that Front end 
performs are copy propagation and constant folding[l]. The various optimizations 
also include loop normalization which is very useful for the later phases. 

1.2.2 Data dependence analyzer 

Traditionally, the compilers treat complex variables, such as arrays, in a very con- 
servative way. They assume a reference to an element of a complex variable as a 
reference to the entire data object. The scientific programs spend most of their 
execution time in the loops which contain array references [12]. Obviously, the 
traditional data flow analysis leaves most of the parallelism in the loops unexploited. 
The data dependence analysis used in a restructuring compiler, determines whether 
a given pair of array references results in a dependence. There are number of data 
dependence tests available which use the subscript expressions and loop bounds 
to solve the dependence problem. General literature on this subject is widely 
available[18, 13, 10, 3]. A data dependence graph (DDG) is constructed based upon 
the data dependence decision algorithms to make the dependence relations between 
statements explicit. 

1.2.3 Restructurer 

The various loop restructuring techniques are applied on the DDG [17, 12). The data 
structure needed for program representation is program dependence graph(PDG)[4] 
which supports operations on the program very efficiently. The restructurer trans- 
forms the loop-nests in such a way that they can be executed on different processors 
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to increcise the speed-up. Some of the loop restructuring tediniques that are im- 
plemented are node-splitting, loop interchanging, loop fusion, loop distribution and 
cycle shrinking. 

1.2.4 Scheduler 

The loop restructuring transforms the loop-nests such that they can be executed on 
different processors. The scheduler[12] assigns processors to loop iterations, so as 
to maximally utilize the available processors. The main goal of scheduling is load 
balancing and synchronizing each loop iteration. The two types of scheduling used 
are static and dynamic. The static scheduling assigns various iterations to dilfferent 
processors at the time of compilation. Sometimes, the static scheduling cannot be 
done owing to insufficient information available at the time of compilation. For 
example, the loop bounds may not be known or the loop body may have conditional 
statements. In such cases, the dynamic scheduling is employed. It inserts code in 
the user programs to assign iterations to various processors during execution. 


1.3 Objectives of the thesis 

The main objective of the work, reported in this thesis, is to enhance the data depen- 
dence analysis and restructuring capabilities of FRAMES. Tests such as GCD and 
Banerjee were implemented earlier. These tests are less powerful and cannot handle 
the whole spectra of types of subscript expressions that are usually encountered in 
programs. One of the objectives of the thesis is to add various dependence tests to 
cover all types of subscript expressions. In restructurer, node splitting was the only 
restructuring technique that was implemented earlier. Incorporation of some more 
general restructuring techniques into FRAMES is one more objective of the thesis. 


1.4 Organization of the thesis 


• Chapter 2 discusses the basic concepts of data dependence analysis. Some of 
the algorithms that are used during implementation are also discussed. 
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• Chapter 3 discusses different aspects of dependence tests. It explains various 
tests, Banerjee’s test, Lambda test. Power test and Omega test, in detail. 

• Chapter 4 gives a brief description of PDG. Algorithms for various restructur- 
ing techniques(loop interchange, cycle shrinking, loop distribution) are given 
in this chapter. 

• Chapter 5 is dedicated to the testing of various aspects of code, possible future 
developments, short commings and conclusions. 



Figure 1.1: The structure of the compiler 
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Chapter 2 

Data Dependence Analysis 


2.1 Introduction 

The dependences between two statements in a given program can arise in two 
different ways. Firstly, when the value computed in one statement 5 i is dependent 
on some other statement S2 and hence, the computed value would be incorrect if the 
order of execution of the two statements is reversed. Such dependences are called 
data dependences. In the the following example, the statement S2 depends upon 5 i 
since the value assigned to W in S2 depends on X which is computed in Si. If the 
order of execution is reversed, the value of W may be incorrect. 

5 i: X = Y*Z 

S2: W = (X 4 l)*V 


The second type of dependences, called control dependences, occur between a pred- 
icate and a statement such that the value of the predicate immediately controls the 
execution of the statement. Consider the following example. 

Sx: if (X) then 

S2: Y = Y 4 1 


6 
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Here, 82 depends upon the predicate X in statement Si, since the vzdue of X 
determines whether 52 is executed or not. The control dependences have been 
thoroughly discussed in [l, 4]. 

Traditional data flow analysis analyzes dependences involving scalar variables 
and treats the complex variables similarly. That is, it is assumed that any two 
references to the complex variable access the whole data object. This conservative 
approach is, obviously, not sufficient for restructuring compilers because it leaves 
a great amount of parallelism unexploited. The operations on an array may be 
performed in parallel if the various references to the array access different locations. 
Essentially there exists three types of parallelism in programs. 

Coarse grain parallelism is at the subroutine level. Usually, the computation is 
organized into subroutines or coroutines. Various independent subroutines can 
be executed in parallel. 

Medium grain parallelism exists at the loop level. Several different types of 
parallel loops exists depending on the kind of dependence graph of the loop- 
nest. 

Fine grain parallelism is achieved when several independent basic blocks can be 
executed in parallel. Fine grain parallelism also includes parallelism at the 
statement and operation level in the given program. 

The Data dependence analysis is intended to address medium and fine grain paral- 
lelism. 


2.2 Data Dependence Concepts 

Consider the following general form of loop-nest which represents perfectly nested 
as well as imperfectly nested loops. 
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DO /i * ii. Ui 

DO /, » Lsdi, i2, .... Ua (I'l, *2 l,_l) 

DO /,+x * l2, .... it), Ua+l(.il, 12, ...» it) 

DO [p *" Ijp(^ii , t2 » ...» ip~~i) , ^p^ii » ^2 * •.*> 

5'l : *2. ...» tp) . /2(»1, *2. ... ,ip) , .... /m(*l. *2. *p)] 

ENDDO 

ENDDO 

DO ip4.1 * .^p+l » ^2 > .*•> it), ^p+lC^lj 12* ...» 

DO “ Ijq(t,i\ , Z2» ...» 2^ — 1)» ^^2 C^l> t2» ...» ig—i) 

S 2 : FCAlgidi, t2» .... i?), i?2(*l» *2» ...» iq) , ...» 5m(tl. t2» ••.» *fl)3) 

ENDDO 

ENDDO 

ENDDO 

ENDDO 


where 

• ft and gi, {I < i < m), are arbitrary subscript expressions. 

• m is the number of dimensions of the array reference. 

• F is an expression involving an array reference. 

S2 depends on Si if and only if there exists two integer vectors .,i,, ia+i 

and .. ,;„jp+i,...,i,) such that L* < hjk < Uk, (I < k < q) and the 
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following system of equations is satisfied. 

*2? • • • > *p) = 

/2{*1) ^25 • • • » *p) — 

*2) • • • 5 *p) == 

Intuitively, a dependence between the statements Si and S2 means that statement 

51 computes a value in an instance i that is subsequently used by the statement 

5 2 in the instance j. If /j’s and ^j’s are permitted to be arbitrary functions of the 
loop index variables, then solving the data dependence problem becomes extremely 
difficult. When /,’s and gi's are restricted to linear functions, the problem becomes 
more tractable. It should be emphasized, however, that the simplified problem is 
in the class of NP-complete problems. Assuming that the functions /j’s and gi's 
are linear functions of the loop index variables, the dependence problem reduces to 
finding simultaneous solutions to the equations of the form 

OiXi -f 0-2X2 + . . . + a„x„ = c ( 2 . 2 ) 

Such equations are known as linear diophantine equations. Some properties of the 
linear diophantine equations are discussed in Section 2.4. 

2.2.1 Types of Dependences 

There are three types of dependences that can exist between two given statements 

51 and S 2 in a program. Assume that the control flow within a program can reach 

52 after passing through ^i. Let IN{S) be the set of memory locations read in 
statement S and OUT{S) be the set of memory locations written in statement S. 

Definition 2.1 Flow Dependence: A flow dependence, denoted by S16S2, exists 
between S2 and Si iff OUT{Si)n IN{S2) ^ 0 - The following example illustrates the 
relation Si6S2- 

Si: X « ... 

52: ... * X 


5'2(il,j2,---)i?) 
9m(,jlij2i • • • ijq) 
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Definition 2.2 Anti Dependence: If lN{Si)r\OUT{S2) ^ 0, then there exists anti 
dependence, written S\SS 2 , between the two statements. In the following case, S 1 SS 2 
holds. 


Si: ... = X 

53 : X = ... 

Definition 2.3 Output Dependence: Si and S 2 are involved in output dependence 
if OUT{Si) D 0UT{S2) ^ 0. The notation 5i5®52 is used to describe the output 
dependence between the statements Si and 52 - In the example given below, Si and 
S 2 both assign a value to X and hence, are involved in output dependence. 

Si: X = ... 

S2: X = ... 

Anti and output dependences are false dependences and can be eliminated by simple 
techniques such as variable renaming. Flow dependence is also referred to as true 
dependence in the literature and it cannot be eliminated. 

2.2.2 Direction and Distance Vectors 

The dependence between two statements can be characterized using the distance 
and direction vectors. Using these vectors, we can determine 

• loops that carry the dependence 

• direction of the dependence 

• the number of iterations of a loop between a pair of references involved in a 
dependence. 

Definition 2.4 A direction vector is a s-tuple ^' 2 » • • • > where € 

{<,=,>,*} and we write when 

• there exist particular instances of and 5^, say 5„{ii,i2, . . . and Sy,[ji,j 2 ,- • • ,jt], 
such that 5u[ii , 121 ■ • • ) 5 J 2 ) • •• jJa] ^*^d 
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• Hik)'^kO{jk) for 1 < s. 

A direction vector has s elements corresponding to the common loops enclosing 
both the references. 0,- = ‘ >’ signifies that the dependence is carried in backward 
direction with respect loop i. Similarly, ^^ = ‘ <’ means that the dependence is 
carried in the forward direction for loop i. Finally, an ’=’ element in the direction 
vector denotes that the dependence is loop independent. If all the directions axe 
valid, or if the direction of the dependence is unknown then the corresponding 
element is usually represented by 

Definition 2.5 A distance vector, represented by D(di,d2,..-,dt) where d,- is an 
integer and we say SvS(d^,di,...,d,)Stv when 

• there exist particular instances of and 5^;, say S'„[ii,i 2 ,. ..,*«] and S^[ji,j 2 ,.- . ,i*], 

such that 12 , . . . i 2 , . . • ,ii] 

• jk = ik + dk and 1 < A: < s. 

The direction vector can be easily obtained from the distance vector by considering 
the sign of an element in distance vector. The positive and negative elements of 
the distance vector respectively correspond to ‘ <’ and ‘ >’ in the direction vector. 
Similarly, if an element in the distance vector is 0, it implies ‘=’ for the corresponding 
element of the direction vector. 

The depth of dependence denotes the loop due to which the data dependence 
between two statements arises. The depth of dependence is defined as follows. 

Definition 2.6 S2 depends on Si at depth d (denoted SiAdS 2 ), if there exists a 
k>d such that Si6kS2- In other words, 

A. = £4 

kzzd 

In the following example, the direction vector is {<>>} the distance vector is 
{1,-2}. The depth of dependence is 1. 
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DO I = 1. N 
DO J « 1. M 
Ad. J) = ... 

= ACI - 1, J + 2) 

ENDDO 

ENDDO 

2.2.3 Loop carried and Loop independent Dependence 

There are two ways in which data dependence can arise between different statements 
in a loop-nest. The value stored by one statement may be fetched by another 
statement in a later or in the same iteration of the loop. The former C 2 ise is known 
as loop carried dependence while the other is called loop independent dependence. 

Definition 2.7 $2 has a loop carried dependence on 5i if there exists ii and 12 such 
that 1 < *1 < 22 < N and /r(2i) = 

Definition 2.8 S 2 has a loop independent dependence on Si if there exists i (1 < 
t < N) such that S 2 > Si and fr(i) = 5r(0- 

For the sake of clarity, the above definitions are given with respect to a single 
loop. Our interest lies mainly in loop carried dependence, because in MIMD archi- 
tectures the entire loop body for a particular loop iteration is executed on a single 
processor. Since all the statements within the loop body are executed without any 
possible change in the execution order, loop independent dependences are always 
preserved. 


2.3 Data Dependence Graph 

A data dependence graph is a conceptual representation of inter-statement de- 
pendences in a loop-nest. Each statement is represented by a vertex and each 
dependence is denoted by a directed edge in the graph. 
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DO I = 1, 100 
A (I) = 5.0 
DO I = 1, 100 

52 : B(I, J) = A(I+1) 

ENDDO 
DO J = 1, 100 
53 : C(I, J) = B(I, J) 

54 : A(I+1) = C(I-1, J+1) 
ENDDO 
ENDDO 



Figure 2.1: Loop nest and its corresponding DDG 


Definition 2.9 A dependence graph G is an ordered pair (V, E) where V is the set 
of vertices representing the statements in the given loop-nest and E is the set of 
edges representing inter-statement dependences. Each edge e € 5 may be viewed 
as quadruple < Si,S 2 ,t,v >, where there is a dependence, of type t (flow, anti, 
output), from Si to S 2 and v is either a direction or distance vector associated with 
the dependence. 


The Figure 2.1 contains a loop-nest and the corresponding DDG. 

Definition 2.10 For two statements 5i and S 2 , »?°(5i,52), the nesting level of the 
direct dependence of ^2 on Si is the maximum depth at which the dependence exists, 
that is 


ASuSi) 


max{k > \\S 16 kS 2 ), ifSiAS 2 
0, otherwise 
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2.4 Linear Diophantine Equations 

A linear diophantine equation is of the form given in Equation 2.2. The linear 
diophantine equations play a significant role in data dependence analysis. Since the 
subscript expressions used in practice are linear functions of the loop index variable, 
the dependence between a pair of array references can be formulated by a system of 
linear diophantine equations. The following theorems describe an elementary result 
from number theory which can be used to determine whether there is an integer 
solution to a given linear diophantine equation. An extension to a system of linear 
diophantine equations is also presented. 

Theorem 2.1 Let Ci, aj, . . . , a„, and c denote integers such that c; ’s are all not 0, 
and let d = gcd(ai ,a 2 , On). The Equation 2.2 has an integer solution if and only 
if d divides c. 

Theorem 2.2 Assume that d divides c. Let d = c/d, A = (01,02,. .. D = 

(d, 0, . . . ,0)i, and U any nxn unimodular integer matrix satisfying UA = D. The 
general solution to Equation 2.2 is then given by the formula 

(ij , X 2 , . • . , rn) ~ (c , ^3) • • • 5 

where t^, is, ••., tn art arbitrary integers. 

Theorem 2.3 Consider a system ofm linear diophantine equations in n variables: 


OiiXi O12X2 "i" . . • "H 

+ 0,22^2 + . • • + 0.2n^n — ^2 


(2.3) 


^ml^l "t* ^m2^2 4" • • • “I" ^mn^n — 

The above equations can be written in the matrix notation as follows. 


xA = C 
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A = 


15^2, 

...,Cm)'^, and 

an 

a-ii . . . Otoi 

ai2 

022 • • • 0-m2 

• 

: 

^In 

02n . . . amn 


where Oij ’s and c* ’s are integer constants. Let U denote an n x n unimodular 
integer matrix and D, an n x n echelon integer matrix such that UA = D. If an 
n X 1 integer matrix t exists satisfying tD = C, then x = tU is a solution to the 
system. Conversely, if x is a solution, there must exists an n x 1 integer matrix t 
satisfying 

tD = C and x = tU 


These three theorems determine the existence of integer solutions to a given set of 
linear diophantine equations and used extensively in data dependence analysis. The 
interested reader may refer to [3] for more details and proofs of the above theorems. 


2.5 Data Dependence Frame Work 

A generalized frame work for computing direction vectors was proposed by Wolfe 
[17]. It determines whether and under what conditions the array regions accessed 
by the two references intersect. The two regions share common elements when the 
subscript functions in Equation 2.3 have a simultaneous solution. 

First, a test is made to determine the possible existence of dependence for the 
direction vector ♦). If dependence is not ruled out, then one direction 

vector element is refined to ‘<’, ‘=’ or ‘>’. If dependence is not ruled out with this 
refined direction vector, then the regions accessed by the two references are disjoint. 
This way dependence testing is done on a hierarchy. The hierarchy for two loops is 
shown in Figure 2.2. 
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(*,♦) 



{<,<) {<,=) (<,>) (<,=) (=,=) (>,=)(<,>) (=,>) (>,>) 

Figure 2.2: Refinement of direction vector 

2.5.1 Legal Direction Vectors 

The validity of direction vectors also depends upon the control dependences of the 
references. Assume that Si and S2 are two statements involved in a dependence. 
In the following example, the two statements Si and S2 are executed under same 
control conditions. Hence all the direction vectors are valid. 

DO I « L, U 
Si : . . . 

52:... 

ENDDO 

On the other hand, the conditional statements in the loop may change the possible 
direction vectors. Consider the following example. 
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DO I = L, U 
IF (...) THEM 

ELSE 

Sj:... 

ENDIF 

ENDDO 

The two statements S\ and S2 are executed under mutually exclusive conditions. 
Therefore, none of the dependence directions is valid. A loop exit, as illustrated 
in the following example, can also affect the validity of some of the dependence 
directions. 

DO I = L, U 
IF (...) THEN 

52:... 

goto label 
ENDIF 
ENDDO 
label: 

Whenever S2 is executed, the unconditional exit from the loop guarantees that Si 
would never be executed after ^i. Hence, S16-S2 is allowed but 52 ^> 5 i is no longer 
possible. For an inner loop in a loop-nest, the rules are same as for single loop when 
direction vector elements for outer loops are all (=). When the direction vector for 
any outer loop is (<), then any direction for an inner loop is allowed. 
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2.6 Implementation Details 

In this section some of the implementation issues are considered. Data dependence 
algorithm is given below. The naming of the functions and identifiers in the code 
implicitly specifies their purpose. 
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Algorithm: Data Dependence Analysis 

Input: Loopheader of the loop-nest in question. 

Output: Data dependence graph. 

DDA() 

{ 

get loop limits in the loop-nest; 

initialize the data dependence graph with vertices; 

for {stati — each vertex in the graph ) 

for {stati = each vertex in the graph starting from statj) 
if ((any of the array defined in both the statements) or 
(any of the array defined in one statement is used in 
another statement)) then { 
get arrays referenced in stati and statg] 
for (each array reference in stati) 
for (each array reference in stats) { 
refi = write reference among the two; 
refs = remaining reference in the pair; 
standardize subscripts in references refi and re/2; 
call dependence tests; 

enter the results into log file for analysis purpose; 
if (all the tests report dependence) { 

get the intersection of sets of direction vectors 
reported by the tests; 
get the best distance vector; 
find type of dependence; 
add corresponding edges into the DDG; 

} 

} 

} 


} 
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Each linear expression is stored in the form of a vector of fixed length. The 
coefficient of variable in the expression is the element of the vector. The 
constant term is the element. getJLoopinfoO routine analyzes each loop in 
the given loop-nest. Each loop is named by a unique number. The loop limits 
expressions are saved in upperb and lowerb. The loop limits can be 

• constant 

• linear equation in terms of outer loop index variables 

• non-linear containing unknown symbolic variables 

Two vectors are used to save the nature of each loop limits. The corresponding 
element of the loop in LooplimitsContainSymbolicVariableG is set if any of the 
loop limits contain symbolic variables. Similarly LooplimitsAreLinear [] specifies 
whether the loop limits are linear. This type of memorization enables the dependence 
tests to determine the nature of the loop limits. 

The subscript expressions of the given pair of array references are saved in 
acoeff and bcoeff. The symbolic variables that may present in the subscript 
expressions can be invariant or variant with respect to the loop-nest. The na- 
ture of the symbolic variables in the subscript expressions is determined by the 
routine processSymbolicVariable(). If the variable is constant with respect to 
the given loop-nest, the corresponding subscript expression is linear. Otherwise 
the subscript expression is non-linear. This information is saved in the vector 
SubscriptContainSymbolicVariableD . A call to call_dep_tests() invokes each 
dependence test in the order specified by the user. 

Since all the tests are conservative in nature, if any one of the test determines 
independence, no edge is added to the dependence graph. The conservative nature 
of the dependence tests may permit dependence directions for which no dependence 
exists. The other way is not possible. Hence, the intersection of the set of direction 
vectors returned by various tests is taken. Only these direction vectors are added 
to the dependence graph. 

Once dependence is assured between any two statements, the type of dependence 
is to be determined . The following algorithm determines the type of dependence 
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based upon the lexical positions of the references and direction vector of the depen- 
dence. 

Algorithm: Type of Dependence 

Input: Direction vector and types of references. 

Output: Type of dependence. 
typeofdependence() 

{ 

for (each element in the direction vector) 
if (element = ‘<’) then { 

if (both are write references) then 

return output dependence from stati to stat2; 

else 

return flow dependence from stati to stat2', 
endif 

} 

elif (element = ‘>’) then { 

inverse the direction vector; 
if (both are write references) then 

return output dependence from stat2 to stati; 

else 

return antidependence from stati to stat2; 
endif 

} 

if (stati and statj are same) then 
return no dependence; 
elif {stati lexically precede stoti)then 
if (both are write references)then 

return output dependence from stat2 to stati; 

else 

return anti dependence from stat2 to stati; 
endif 

elif (both are write references) then 

return output dependence from stati to stat2; 

else 

return flow dependence from stati to statj; 

} 
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Chapter 3 

Tests for Dependence Analysis 


3.1 Introduction 

The data dependence tests are decision algorithms which determine the existence of 
an integer solution to a given set of linear diophantine equations. As mentioned in 
Section 2.4, this is an NP-complete problem. Most of the dependence tests check 
for some necessary conditions for the solutions to exist. Some other tests employ 
integer programming techniques to find a general solution to the given dependence 
problem. 

The dependence tests are conservative in nature. That is, they assume depen- 
dence unless it is explicitly ruled out by a violation of the necessary conditions. The 
various tests differ in the methods they employ and the amount of information they 
provide. Some tests give only “yes” or “no” answer to a given dependence problem. 
At the other extreme some tests can enumerate all the solutions. 


3.2 Various properties of dependence tests 

• Some tests, for example Gcd and Banerjee’s test consider each subscript at 
a time. These tests may fail when the given subscripts are coupled* because 

*If the same index variable appears in more than one subscript expressions, the latter are known 
as coupled subscripts 
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the solution for one subscript expression may not satisfy the other subscript 
expressions. 

• The various tests differ in how accurately they can determine the dependence. 
For example, the Power and Omega tests are more accurate than the simple 
tests - Gcd and Banerjee’s test[13]. The more accurate tests, obviously, employ 
more expensive techniques to solve the diophantine equations. 

• The iteration space formed by the loop bounds can be triangular, rectangular 
or trapezoidal in form. The constant loop bounds result in rectangular regions. 
On the other hand, if the bounds themselves are functions of the enclosing loop 
index variables, the iteration space may be triangular or trapezoidal. The 
rectangular iteration spaces represent the simplest cases and some tests, for 
example, the I test and Lambda test are applicable only for such spaces. 

• Most of the dependence tests are not applicable when unknown variables occur 
in either loop limits or in subscript expressions. The I test can be applied even 
when some of the loop bounds are not known. The Omega and Power tests 
can be applied to the dependence problem with symbolic constants. 

• The single index exact test given in [17] handles subscript expressions with 
one loop index variable. It is an exact test and can also be used to enumerate 
all the solutions within the loop bounds. 

• It is more useful for the purpose of restructuring to know the direction or 
distance vector of the dependence, if it exists. But some tests can provide 
only “yes” or “no” answer and hence, are not much useful. 

Unfortunately there is no general test which handles all the cases efficiently. An 
extensive analysis on the performance of various dependence tests is reported in 
[16] 



24 


3.3 I Test 

The I test is an inexact subscript by subscript test proposed by Kong et. al [ 8 ]. 
The I test combined Gcd and Banerjee’s test in the sense that it determines integer 
solutions within the bounds. Moreover, it may determine independence even if some 
of the loop limits are not known. 

Definition 3.1 Let 01,02, . . . ,a„, L and U be integers. The equation, 

Ci/i + 02/2 + - • . + Onfn = ( 3 * 1 ) 

where Mk ^ h Nk {I < k < n), is referred to as an interval equation, will be 
used to denote the set of ordinary equations consisting of 

OlA + 02^2 + • • • "f" ®nfn = L 
Cili + 02/2 + . . . -b Onfn = L + 1 

Ol /l + 02^2 "b • • • "b = U 

The I test is based on the following theorem 

Theorenn 3.1 Let 01,02, ... ,an be integers. For each k, (1 < k < n — 1 ), let each 
of Mk and Nk be either an integer or the distinguished symbol ‘*’(not known), where 
Mk < Nk if both Mk and Nk are integers. Let M„ and Nn be integers, Mn < Nn- 
//la„| < U — L+ 1 , then the interval equation is (Mi,Ni; M2, N2 ; . . . ; Mn, Nn)-integer 
solvable iff the interval equation 

<*1/1 + 02/2 + • • . + fln-ifn-i = [L — a'l^Nn a,^Mn,U — a^ a„Nn] 

is (Ml, Ni; M2, N2-,...] Mn-i, Nn-i)-integer solvable. 

The algorithm is discussed in more detail in[ 8 ] 
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3.4 Lambda Test 

3.4.1 Description 

The Lambda test is an inexact test which is devised to handle coupled subscripts. 
Usually, subscript by subscript tests determine dependence by examining each di- 
mension of the array references. If the examination of any dimension shows no 
solution then there exists no data dependence between the two references. However, 
the sets of solutions for any two dimensions may be disjoint. The subscript by 
subscript tests fail to recognize such cases frequently. The Lambda test is more 
effective in such cases [6]. This test can be applied only when the array references 
have constant loop bounds. However, trapezoidal regions can be converted to 
rectangular regions so that Lambda test can be applied. This may lead to inaccuracy, 
but the result is still conservative. One more limitation of this test is that it does 
not handle unknown variables in the subscript expressions. 

Formally coupled subscripts are described as follows. 

Definition 3.2 We denote the set of loop indices in referencei and referencet by 
IND = • • • jJij} and denote the index set of referencei and 

reference-i that appear in the array dimension j, 1 < j < m, by INDj = {t|t € 
IND and i appears in either /,• or gj) 

1. If INDdi n INDd^ 7 ^ 0, then dimension di and are said to be coupled and 
referencei and reference<i are said have coupled subscripts. 

2. If dimension di and d 2 are coupled, and d 2 and are coupled then di and da 
are also coupled. 

The Lambda test can be divided into two parts. 

1. finding the linear combination of coupled subscripts 

2. finding maximum and minimum values of the linear combination using loop 
bounds and direction vectors 
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Finding linear combination: The loop bounds and the given dependence di- 
rections correspond to a bounded convex set V in J?". Each linear equation in 
Equation 2.3 is a hyperplane x in space. The intersection S of m-hyperplanes 
corresponds to the connmon solutions to all the equations. If S is empty then there is 
no data dependence. If the hyperplane x intersects V then the corresponding equa- 
tion has real valued solutions within the loop bounds. Any subscript by subscript 
test can test for this condition. But to determine the real valued simultaneous 
solutions, it is needed to determine whether S itself intersects V. If any of the 
hyperplanes do not intersect V, then S cannot intersect V. However, even if every 
plane in Equation 2.3 intersects V, it is still possible that S and V are disjoint. 

Theorem 3.2 S n V = ^ iff there exists a hyperplane, ic, which corresponds to a 
linear combination of equations in Equation 2.3, + YlTLi = Oj 

that X n V = 0 . 

The first part of the Lambda test finds the necessary and sufficient A-tuples to form 
the linear combination of m equations. We first consider two coupled subscripts and 
then expand the concept to generalized version. 

Two coupled subscripts: An arbitrary linear combination of two equations in 
Equation 2.3 can be written as Ai/i -|- A 2/2 = 0. The domain of Ai, A 2 is the whole 
two dimensional space. Let 

/xj.Aj = •^1/1 -b ^ 2/2 

= (Aiflu + A2a2i)u^*^ + (Aiai2 + A2a22)v^^^ + . . . + {AjOin -f A2a2n)t'^”^ 

fxi,X 2 can be viewed as a linear function of (Ai,A 2 ) in two dimensional space 
with .. fixed. The coefficient of each in /xj.Aj is a linear function 

of (Ai, A 2 ) in two dimensions, i.e., = AiOi, -}- A 2 a 2 i- This is a straight line, called 

rp line, passing through the origin and makes the space into two halves. There are 
atmost n ip lines which together divide the space into atmost 2n regions. Each region 
is a cone called A cone. 
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Lemma 3.1 Suppose V is defined by loop bounds but not by dependence directions. 
V fxiM — 0 intersects V for every (AijAj) in every if line then = 0 also 

intersects V for every (Ai,A 2 ) in B}. 

Intuitively, the Lemma 3.1 states that if all the A-tuples on the boundary of A cone 
make /ai.Aj intersect V then /xi.xj = 0 intersects for all (Ai,A 2 ). According to 
Lemma 3.1, we get infinite number of A-tuples, which is not feasible for practical 
purposes. For each Aj, A 2 we get different /xj.Aj- The maximum and minimum value 
of /aj.Aj depends upon the sign of the coefficients of that function. Now consider V 
defined by direction vectors as well as loop bounds. Let u,- and Vj be the same loop 
indices occurring in two references. Direction vectors gives us the relation between 
two variables Uj and vj. The maximum and minimum value of each variable depends 
on the direction vector as well as coefficients a,- and aj of v,- and Vj respectively [10]. 
Let = Ai(ai,- -f ajj) -f A 2 (a 2 ^ •+ a 2 ^), since and are related by direction 
vectors. This is a line called 4> line, in two dimensional space. Now the minimum 
value and maximum value of the function /aj.Aj depends not only on the sign of the 
coefficients of each v but also on the sign of There are atmost n/2 d> lines. All 
(f lines and V’ lines divide two dimensional space into 3n regions. It is proved that 
if Aj.Aj = 0 intersects V for any (Aj, A 2 ) in every if line and f line, then /ai,A 3 = 0 
also intersects V for every (Aj, A 2 ) in a V’ line or 4 line. Hence it is suffice to test a 
single point in each line. The algorithm can be summarized as follows 

1. find a point on each if) line (or <j> line). 

2. form the linear combination of equations in Equation 2.3 

3. if the resultant equation intersects V then goto Step 1. otherwise report no 
dependence. 

Generalized version: An arbitrary linear combination of m equations can be 
written as 


Ai.Aj Am = + -^2/2 + •••-!- Am/m - 0 ) 

= (Sf^iAjaf Vi + V(2) + • • • + (Sf=iA,a;)i;(„). 
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The Lemma 3.1 can be expanded to find sufficient number of A tuples to determine 

whether /x,,a 2 \m = 0 intersects V in space for arbitrary (Ai, Aj, . . . , A^). The 

sufficient number of A tuples is any one point on each A cone boundary. In m- 
dimensional space the A-cone is formed by any m ~ \ x}) ov (f> planes. Hence, there 
is a finite set of hyperplanes in ET' such that S intersects V if and only if every 
hyperplane in the set intersects V. If V is defined by loop bounds alone, then there 
are no more than hyperplanes in the set. On the other hand, if V is defined 

by loop bounds as well as dependence directions, there are no more than (^/i) 
hyperplanes in the set. Algorithm to find out sufficient A-tuples is simple. 

1. Generate (m_i)(or (^/i))combinations of integers from 1 ... n. 

2. Use the elements in each combination as an index to form m — 1 simultaneous 
equations out of n (or 3n/2) equations. 

3. Solve m-1 simultaneous equations for a A tuple. 

4. Form the linear combination of Equation 2.3 using the A tuple. 

5. If the resultant function intersects V goto 1. Otherwise no dependence. 

Does S intersects V ?: The maximum and minimum values of the resultant linear 
combination can be found by using Banerjee’s test. However, Li et al. proposed set 
of rules to find the minimum value and maximum value of the function depending 
on the direction vector. Either of the methods can be used. If maximum is less than 
minimum then there is no dependence. 

3.4.2 Implementation Details 

The routine convertTrapezoidalRegionToRectangularRegionO is used to con- 
vert trapezoidal region to rectangular region. call_lainbda_test() determines the 
number of coupled subscripts and initiate appropriate routine based on the number 
of coupled subscripts. To find the maximum value and the minimum value of the 
linear combination, the routine doesSintersectVO uses the set of rules in [10]. 
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3.5 Omega Test 

3.5.1 Description 

Omega test is aji exact test, which uses Fourier-Motzkin elimination method to solve 
set of inequalities formulated from the dependence problem. Omega test consists 
of three parts. Formulation of equality and inequality constreiints comprises the 
first part. Applying Fourier-Motzkin elimination method on the set of inequalities 
is another part. This in turn produces the system of inequalities in terms of loop 
index variables. Finding out the direction and distance vectors is the last pajt. 
Formulation of problem: The input to the Omega test is a set of linear equalities 
(So<,<na«a:i = 0) and inequalities (So<,<„a,a:,- > 0), where xq = 1, ao is the 
constant term and V is the set of loop indices being manipulated. Each constraint 
is normalized. A normalized constraint is one in which all the coefficients are 
integers and gcd of the coefficients is 1. Given a problem involving equeility and 
inequality constraints, we convert all equality constraints to inequality constraints. 
The resultant set of inequalities has integer solutions if and only if the original 
problem had integer solutions. Euclid’s generalized algorithm discussed in the 
Section 3.6 can be used for this purpose, since equality constraints arises because 
of the subscript expressions of the array reference. However, the following approach 
is followed by Omega test to eliminate equality constraints for better performance. 
To eliminate the equality = 0, 

1. Check if there exists a j ^ 0 such that |aj| = 1. If so, we eliminate the 
constraint by solving for xj and substituting the result into other constraints. 

2. Otherwise, let k be the index of the variable with the coefficient that has the 

smallest absolute value (A: 7^ 0) and let m = |a;t| 4- 1. mod operation is defined 
as atnodb = a — b[{a/b+ We create a new variable a ajid produce the 

constraint mcr = 53i6v(oirnodm)x,'.We solve this constraint for Xk- 

Xk = —Sign(ak)mcr + Eie(v-{Jk}) Sign{ak){aimodm)xi 
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In the original constraint, this substitution produces: 

— lafclmo- + ^2 ((ot — (a,modm)) + m(a,modm))z,- = 0 

iev-[k} 

Normalize this constraint, goto 1. 

Inequality constraints are processed to 


• find contradictory constraints, in which case there is no dependence to the 
system 

• eliminate redundant constraints 

• tighten the constraints 

Intuitively, we are decreasing the solution space constituted by the dependence 
problem in the cartesian coordinate system. If the problem involves atmost one 
variable and has passed the above tests, we report that it has integer solutions. 

Fourier-Motzkin elimination: In the second part of the Omega test we apply 
Fourier-Motzkin elimination to eliminate a variable from the set of inequalities. 
Intuitively, Fourier-Motzkin variable elimination finds the n — 1 dimensional object^ 
in n dimensional space. There may be integer points in the shadow of aa object, 
even if the object itself contains no integer points. This is called real shadow. To 
determine real shadow, consider two constraints on z:a lower bound ^ <az and an 
upper bound az < a(where a and b are positive integers and a and are linear 
combinations of the remaining variables in the system of inequalities). We combine 
these constraints to get a/? < abz < bot. The constraint a/? < 6o: is the shadow of 
intersection of these two constraints. By combining the shadow of the intersection 
of each pair of upper and lower bounds on z we obtain, what is called real shadow. 
We define dark shadow of the object, which ensures that for every integer point in 
the dark shadow, there is an integer point in the object above it. To determine the 
dark shadow, consider the case in which there is an integer solution to ^ hot. 
The dark shadow is {ba - a/3) > (a - l)(b - 1). 


^solution space defined by the set of constraints 
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The algorithm for checking for the existence of integer solutions to a set of 
constraints is summarized as follows: 

1. choose the ‘best’ possible variable to eliminate. ‘Best’ variable must be able 
to provide either exact projection, where dark and real shadows are identical, 
or the coefficients of the corresponding variable as close to zero «is possible. 

2. Calculate real and dark shadows of the set of constraints. If both the shadows 
are equal then there are integer solutions to the original set of constraints, iff 
there are integer solutions to the shadow. 

3. Otherwise, if there are no integer solutions to the real shadow, no solution to 
the system of inequalities. If there are integer solutions to the dark shadow, 
system has solutions. 

4. Otherwise determine the largest coefficient a of z in any upper bound on z. 

For each lower bound bz < test if there are integer solutions to the original 
problem combined with + i for each i such that {ab — a — b)/a > * > 0. 


Direction and distance vectors: For each common loop, a new variable is 
introduced. The value of this variable determines the distance and sign of the 
variable determines the direction of dependence w.r.t the loop. Now the problem is 
projected on to these variables. Unprotect the variable whose sign is determined or 
the variable is uncoupled Now the entire problem is projected onto the remaining 
variables. Otherwise, choose one protected variable and generate subproblems for 
two or three possible signs for the variable. This process is done recursively to 
enumerate all the vectors. This is the only dependence test that does not use the 
frame work that is discussed in Section 2.5 to generate all the possible direction 
vectors. 


®aii uncoupled variable does not associate with any other variable in any inequality 
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3.5.2 Implementation Details 

initial ize_EQji_GEQ() forms the problem from the array reference subscript ex- 
pressions and loop limits. initializeOmegaO initializes the internal data struc- 
tures used by the Omega test. simplifyProblemO accepts a pointer to the problem 
and returns 1 if a solution exists, otherwise 0. The problem is projected on to the 
protected variables and the resultant problem is returned. unprotectVariableO 
accepts a pointer to a problem, a variable and unprotects it. const raintVariableO 
constraints the variable to have the sign -1-1, 0, -1, unprotects it, reduces the problem 
and returns 1 if solution exists. calculateDDVectorsO calculates direction vectors 
and distance vectors by the method discussed above. 

3.6 Power Test 

3.6.1 Description 

The Power test is a combination of Euclid’s generalized algorithm and Fourier- 
Motzkin elimination method. Euclid’s algorithm is used to determine whether the 
simultaneous linear diophantine equations, derived from subscript expressions of 
array references, has a solution without considering the loop bounds. Fourier- 
Motzkiu method is used to eliminate variables in a system of inequalities derived 
from the loop limits and direction vectors. The byproduct of Euclid's algorithm 
is the system of one or more linear equations to which simultaneous diophantine 
equations are reduced. 

Consider the set of simultaneous diophantine equations in Equation 2.3. The 
coefficient matrix A is 


ail 

021 

• • • ^ml 

ai2 

022 

• • * ^m2 

flln 

02n 

• * • ^mn 


A = 
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Algorithm 5.5.1 in [3] transforms the coefficient matrix A„xmi formed from diophan- 
tine equations, into unimodular integer matrix U„xn and echelon integer Dnxm such 
that UA = D. Elementary row transformations are applied on A to transform it 
into an echelon matrix D. The same sequence of operations are performed on the 
unit matrix I„xn yielding U, where n is the number of variables in the problem of 
equations. The echelon matrix D and unimodular integer matrix U are of the form 


dll 

<^21 

■ • • dml 

0 

^22 

. . . dm2 

• 

* 


0 

0 

. . . dmm 

* 

• 

\ 


Uii 

U21 

. . . U,j 1 

Ul2 

U 22 

... Un2 

• 

• 

' 

«ln 

U2n 

••• Unn 


If there is an integer solution vector t such that tD = C, then hA = C, where C 
is the constant vector of simultaneous equations and h is loop indices vector. After 
finding D and U, the test finds values for <i through tm by solving t D = C using a 
simple back propagation algorithm. If there are feasible solutions, the extended GCD 
algorithm stops here and assumes dependence. It also gives formulas that can be 
used to specify the index variables hi, h^, hn in terms of the ‘free’ variables tm+i, 
tm+ 2 , - • - j ^ derived from the matrix product h = t U. If the dependence system 
has constant dependence distances, we can find them by subtracting corresponding 
equations. That is, the dependence distance for a loop at depth k(l < k < c) is 
found by subtracting the equations for ik and jk- Suppose that t* = h^k-i and 
j = /i 2 ;fc'ithe two equations are subtracted by looking at 
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(where U..^- is a column of the the matrix U). If the dependence distance is fixed, this 
will have non zero coefficients only for through tm-, which were previously solved. 
If there are non zero coefficients for any other t„, where v > m the dependence 
distance is not constant. 

Till now we have exploited all the capabilities of Euclid's algorithm. In the above 
phase we haven’t considered all the constraints on the dependence system such as 
loop bounds. Loop limits and direction vector information comprises a set of linear 
inequalities or constraints on the set of ‘free’ variables. 

The Power test constructs a list of upper and lower bounds on each ‘free’ variable 
tk, each lower and upper bound will be linear combination of tm+i, tm+ 2 , •••> 
ffc-i -These give the boundaries to the solution space to the dependence equation;if 
the solution space is nonempty, then the dependence equation has solutions that 
satisfy all the conditions. Each lower bound for tk will be of the form 

Ibktk > Ibo + Ibm+itjn+i + . . . + Ibk-itk-iy Ibk > 0 (3.2) 

These bounds are derived from the constraints on the index variables, such as the 
loop limits. Each index variable hi is defined by the extended GCD algorithm 
as some linear combination of the ‘free’ variables. In addition, the upper and lower 
limits of each index are themselves linear combinations of outer loop index variables. 
Thus the constraint A,- > /,• can be algebraically reduced to an inequality constraint 
on one of the ‘free’ variables, of the form Equation 3.2. 

After formulating the inequalities based upon the loop limits, direction vectors 
are generated according to the frame work discussed in Chapter 2. For each direction 
vector element we get another inequality which derives a lower or upper bound on 
one of the free variables. 

Given a set of inequalities in terms of ‘free’ variables, Fourier-Motzkin visits each 
‘free’ variable (from tn down to tm+i) comparing each lower bound to each upper 
bound. Each comparison will be of the form 

(/6o + Ibitk-i + . .. + lbk-itk-i)llbk < tk < {ubo + ubiU + . . . + 

from which we derive 

{Ibkuba — ubklbo) + {Ibkubi — ubklbi)ti -|- . . . -f {Ibkubk-i — ubklbk-i)tk-i ^ 0 
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If any of the coefficients are non zero, this will derive another lower or upper limit 
on another lower numbered ‘free’ variable. If all the coefficients are zero, then we 
have simple inequality Ibkubo — ubklb^ > 0. If this inequality is not satisfied, then 
there is a solution to the dependence system, and thus no dependence. 

3.6.2 Implementation Details 

dd_siia_T() is a routine to invoke Power test. It returns distance and direction 
vectors. dd_reduce() applies generalized GCD algorithm to reduces the matrix 
A to unit matrix at the same time determining unimodular matrix. dd_solve() 
solves for tj, using back substitution method. dd_fixed() computes 

distance vector if any distance vector exists.dd_T_init () initializes bounds for each 
‘free’ variable. dd_enforce_liniit() finds out all the possible vectors using Fourier- 
Motzkin elimination method. The variable dd is cleared if no solution is detected in 
any of the above steps. 



Chapter 4 
Restructurer 


4.1 Introduction 

The goal of automatic parallelization is to transform sequential code to parallel code. 
The restructuring of the given program must not affect the semantics of the program. 
The restructured loop-nest can be executed on different processors in parallel. If 
there is a dependence relation in the loop-nest that prevents parallelization, then 
restructuring compiler can attempt several simple transformations on the loop-nest 
to remove the dependence relation. For example, when the dependence graph of 
the loop is acyclic, then statement reordering will always allow parallelization. A 
topological sort of the dependence graph specifies how to reorder the statements. If 
a cycle in the graph cannot be reduced to a single statement, then loop distribution 
can be applied to remove the cycle to a separate loop. 

The loop restructuring techniques use dependence information to determine their 
feasibility and profitability. The feasibility study of each transformation determines 
the applicability of the transformation for a given loop-nest. However, the prof- 
itability study of the transformation determines whether transformation extracts any 
parallelism from the code. Once a restructuring technique is chosen to be feasible and 
profitable, appropriate modifications are performed on the program representation 
of the loop-nest. Hence the implementation of a restructuring technique contains 
three parts. 
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1. Feasibility study 

2. Determining profitability 

3. Changes to the underlying program representation 

Every transformation has its own feasibility and profitability criteria. At the backend 
of restructurer, the program representation of the loop-nest is converted back to 
control flow graph[l|. In the following sections we discuss certain restructuring 
techniques that are incorporated into FRAMES. 


4.2 Program Dependence Graph 

The control flow graph (CFG) has been the usual representation for the control flow 
relationships of a program. But this representation does not allow restructurer to 
determine the control conditions of an operation readily. A statement Y is said to 
be control dependent on JC according to the following definition. 

Definition 4.1 Let G be a control flow graph. X and Y be nodes in G. Y is control 
dependent on X if and only if 

1. There exists a directed path P from X to Y with any Z in P (excluding X 
and Y) post dominated by Y and 

2. X is not post-dominated by Y. 

If Y is control dependent on X then X must have two exits; following one of the 
exits from X always results in Y being executed; while taking the other exit may 
result in Y not being executed. Condition 1 can be satisfied by a path consisting of 
a single edge. Condition 2 is always satisfied when X and Y are the same node. 

Many of the transformations required to change the control dependences of state- 
ments in a given program. For example, new control statements are added to CFG 
in cycle shrinking, the order of control statements is changed in loop interchange 
and so on. To support these types of operations, an efficient representation of the 
given program is needed. The essential properties of such a representation are to 
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• make the control dependences of every operation in a given program explicit.' 

• make the operations on the representation transparent 

The Program Dependence Graph(PDG) is the program representation used in 
FRAMES. Since PDG connects computationally relevant parts of the program, 
many code improving transformations require less time to perform than with other 
program representations. A single walk of these dependences is sufficient to perform 
many optimizations. For details of PDG, reader is advised to refer to[4]. 

4.2.1 Description of PDG 

PDG consists of three types of nodes. 

Predicate nodes correspond to the conditional statements in a given program. 

Region nodes summarize the set of control conditions for a node and also group 
all nodes having the same set of control conditions together. 

Statement nodes correspond to the imperative statements in a given program. 

The predicate nodes and region nodes are connected by a conditional edge and 

named True or False. Region nodes and statement nodes are connected by an 

unconditional edge. Each predicate has a unique successor for each truth value. 

The strongly connected component (SCC) in the program dependence graph contain 

nodes consisting of predicates that determine an exit from the loop. The other nodes 

in the PDG not in the SCC lie on some path of control dependence edges from a 

node in the SCC. Intuitively, these correspond to the body of the loop. Nested loops 

appear as distinct SCCs with a control dependence edge between the outer loop and 

each immediate inner loop. Loops at the same level appear as SCCs with a common 

ancestor region. Consider the following loop-nest. 

'It is not possible to determine exactly under what control conditions an operation in a given 
program is performed 
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Figure 4.1: The Program dependence graph 


N • 50 
DO I » 1, H 
DO J = 1, N 

Sikil, J) = J) 

52B(I. J) = A(I-2. J) 
ENDDO 
ENDDO 


The PDG for the above loop-nest is given in Figure 4.1. 
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4.3 Parallel code generation 

Once the data dependence graph is constructed, we proceed to generate paxallel 
code for the given loop-nest as follows. 

1. Find all the strongly connected components(SCC) in the data dependence 
graph. 

2. Reduce the DDG to an acyclic graph by treating each SCC as a single node. 

3. Apply node-splitting, cycle shrinking and loop interchange on each SCC. 

4. Generate code for each SCC in an order consistent with the dependences. That 
is, by using a method similar to topological sort, first generate code for nodes 
depend on no others, then for nodes that depends only on blocks for which 
code has already been generated, etc. 

The code generation procedure can generate loop-nests for the SCCs in a given DDG. 
The loop-nests may consist parallel loops depending on the dependence relations 
between the nodes in that particular SCC. However, it need not give up on an 
SCC that does not allow parallel loops. Restructuring compiler can attempt several 
transformations on the SCC. These transformations may be intended to remove 
cycles or to extract any inherent parallelism in the given SCC. For example, node- 
splitting is used to remove dependence cycles and cycle shrinking is used to extract 
inherent parallelism from a given SCC. Hence the algorithm given in [2] is modified 
to incorporate the third step. The algorithm to generate loop-nest for a given SCC 
is reproduced from [2]. The loop-nest generated may be a mix of parallel loops and 
serial loops. 



41 


Algorithm: Loop-nest Generation for SCCs 

Input: Acyclic data dependence graph. 

Output; CFG of the resulting loop nest. 
parallelcode() 

{ 

initjsccq(sccs, noofsccs, sccq, inedgeno); 
while((scc = pop.q(sccq)) != NULL) 

{ 

foT^deptk = 1 to maxdepth of the loop-nest) { 

let Ddepth he the dependence graph consisting of all 
dependence edges in DDG which are at level-depth or greater 
and which are internal to see; 
if( dependence cycle exsits in D depth) 

generate a level-depth DO statement; 

else 

generate a level-depth DOALL statement; 

} 

sort statements in see in topological order; 
for(each statement in see) 
insert_stmt(); 

process_sccq(sccs, noofsccs, i, sccq, inedgeno); 

} 

} 


The routines init_sccq() and process_sccq() are to ensure that the parallel 
code generated for SCCs is in topological order. init_sccq() calculates the number 
of incoming edges of each SCC from other SCCs. All SCCs with inedge number 
zero are enqueued into sccq. These SCCs are not dependent on any other SCCs. 
Hence code can be generated for these SCCs without data dependence violations. 
process_sccq() recalculates the inedge numbers of each SCCs, not taking the edges 
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from see that is passed as an argument into consideration. It also enqueues Sees 
with inedge number zero. insert_stmt() is the routine which takes a statement 
and the control conditions under which the statement is executed ajid inserts the 
statement in eFG. This is the basic routine to construct eFG from PDG. The 
algorithm to convert PDG from GFG given in [4] is not suitable for our purpose 
[15]. The algorithm for insert_stnit() and the proof of the algorithm is given in 
[15] 


4.4 Node Splitting 

Loop parallelization is impossible when statements in the body of a given loop 
are involved in a dependence cycle [12]. Dependence cycles that involve only flow 
dependences are usually hard to break. There are ca.ses, however, where dependence 
cycles can be broken resulting in total or partial parallelization of the corresponding 
loops. The cycle breaking may be possible if dependence cycles involve flow and 
anti/output dependences. Anti and output dependences are false dependences and 
can either be ignored or eliminated. The backward anti/output dependence edges axe 
ignored. The forward anti/output dependence edges are removed by node splitting 
transformation. The following example illustrates the node splitting transformation. 


DO I « 1. N 

Si:A(I) = BCD 
52:BCI-1) = A(I+1) 
ENDDO 


DO I * 1, H 

52’: TEMP(I) = A(I+1) 


ENDDO 

DO I » 1, M 

Si: A(I) = BCD 

52: B(I-l) = TEMP(I) 


ENDDO 


Before Node splitting 


After Node splitting 


If the dependence cycle in SCC is not broken as the result of node splitting then 
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Figure 4.2: DDG before and after Node Splitting 
the transformation is not profitable. 

Definition 4.2 : A subgraph G'{V, E') is called restricted graph of data depen- 
dence graph Giy ^ E)^\i E' C E and every dependence edge in E is also present 

in E'. 

If dependence cycle is present in G\ where G' is the restricted graph of a given 
see, it is not possible to break the dependence cycle by means of node splitting. 
Hence the transformation is not profitable. 
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Algorithm: Node Splitting 

Input: Data dependence graph of a given loop-nest. 

Output: Data dependence graph 
callnodesplit() 

{ 

find SCO’s in a given DDG. 
for (each SCC in the DDG) { 

if (SCC contains more than one statement) then { 
construct restricted graph G' of SCC. 
if (cycle is not present in G') then 

remove each anti/output dependende 
forward edge by node splitting 

} 

} 

} 

The node splitting is not affected by the presence of scalar dependences. The 
transformation does not change the control dependences of any of the statements. 
The new assignment statements generated during node-splitting are also control 
dependent on the same conditions as that of the node that is split. 


4.5 Loop Interchange 

The loop interchange [17, 11] is the process of switching loops in a given loop-nest. 
This is one of the most powerful restructuring techniques. This technique is mainly 
architecture dependent. Some of the architecture dependent optimizations [17] are 

• multi-processor machines perform better with parallel outer loops 

• most of the parallel machines give better performance with larger loop limits 

• vector machines work well with large vectors than smaller vectors 

• multi processors perform better when many parallel operations are possible 
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The loop interchange can be used to switch the loops in the best possible way 
depending on the underlying architecture of the machine, thereby achieving good 
performance. We consider interchange of simple loops in this work. Interchange of 
triangular loops as well as some advanced loop interchanges are discussed in [17]. 

4.5.1 Feasibility of Loop Interchange 

All loop interchanges will not yield correct results. The requirements for two simple 
loops L and L to be interchanged are specified as follows 

1. Loop L" is nested perfectly within L' 

2. Loop limits of L" are invariant of L' index 

3. There are no statements Su and 5u, in L" with a dependence relation 5u^(<,>)5',„. 
The proof of the last requirement is given below. 

Theorem 4.1 Suppose 5'„6(<,»5,„. Then there are values ii, %i, and ji, j^, where 
*i < ill *2 > h ond S'ufiiiia] ^ •S'w[;i)i 2 ]- If the loops are interchanged, then 
•S^wDUii] be executed before 5u[*i,i2] since j 2 < h- That is sink will be executed 
before source, thus violating the data dependence relation. Suppose there are no 
dependence relations with (<,>) direction vectors. The only possible dependence 
direction vectors between statements Su and Sy, in the loop are (<,<), (<,=), 
(=,<), (=,=), Now, suppose there is a dependence relation Sub{<,<)Sut- Then 
there are values ii, ij and ji, j 2 where ij < ji, i 2 < j 2 and 5„[ii,i2]^‘S'u;[ii>i2]- 
If the loops are interchanged, then 5„[ii,i2] will still be executed before Su,\ji,j 2 [, thus 
satisfying the data dependence relations. A similar argument holds for the direction 
vectors (<, =), (=, <), (=, =). 

4.5.2 Profitability of Loop Interchange 

Profitability of loop interchange differs from machine architecture to machine archi- 
tecture. This fact makes loop interchange the most higher level transformation that 
a compiler can perform. Compiling for multiprocessors need parallel loops. Better 
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performance is achieved if the parallel loops are the outer loops in a given loop-nest. 
Loop interchange can be used to switch serial outer loop and a parallel inner loop. 

DO I = 1, N 

DOALL J = 1, M 

SI: A(I, J) = B(I, J)*C(I. J) + A(I+1, J) 

ENDDO 

ENODO 


In the above loop-nest, the scheduler [12] assigns a processor for every iteration 
of parallel loop. This assignment performed N number of times corresponding to 
N number of serial outer loop iterations. Obviously, the execution of parallel loop 
iterations must be synchronized N number of times. The number of fork and join 
operations to synchronize parallel loop iterations is prohibitively high. This may 
result in parallelized code that is worse than the sequential code. This type of 
profitability can be captured by the depth of dependence. The interdiange of 
two loops is said to be profitable if the depth of dependence between any two 
statements in a given loop-nest is increased prior to interchange. The algorithm 
for loop interchange is similar to bubble sort algorithm and is given below. 
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Algorithm: Loop Interchange 

Input: see and the loops in a given loop-nest. 

Output: The best order of loops in terms of profitability and feasibility, 
loopi nter ch ange( ) 

{ 

find perfectly nested loops in the given loop-nest; 
let the loops are perfectly nested loops; 

for {Li = U-x to lx) 
for {L 2 — In to Lx+x) 

if (switching of Lx and L 2 is feasible and profitable) { 
interchange the loops Lx and L 2 ; 
interchange the corresponding elements of 
the direction vectors in DDG; 
modify the PDG; 

} 

} 

The application of loop interchange on the PDG given in Figure 4.1 transforms 
to the representation given in Figure 4.3. The loop Lz is the duplicate of and 
the loop L4 is the duplicate of Lx- 

4.6 Cycle Shrinking 

In many of the cases it is not possible to eliminate a cycle in the dependence 
graph. Usually, this occurs when the cycle is formed by flow dependence edges. The 
cycle shrinking can be applied to exploit the inherent parallelism in a serial loop 
or in a nest of serial loops. As mentioned in Chapter 2, distance vector explicitly 
specifies the number of iterations between two successive memory location references. 
Obviously, those iterations can be executed in parallel. The cycle shrinking cannot 
be applied with dependence distance one. There are basically two cycle shrinkings 
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Figure 4.3: PDG after loop interchange 
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techniques based on the complexity of the loops. Their performance depends on the 
number of iterations, that can be executed in parallel. The number iterations that 
can be executed in parallel is referred to as reduction factor A. 

4.6.1 Calculation of Reduction factor 

Singly nested loops: Let us consider there are n statements Si, S 2 , Sn, in 
a loop-nest that are involved in a dependence cycle. Moreover, the dependence 
distance between Si and S 2 is ki, between S 2 a.nd Sz is and so on until We 
consider the following cases. 

• All the dependence distances are same and constant. Then the reduction 
factor, A, is any one of distances ki, k^, kn (for further details and proofs, 
reader may refer to [12]). 

• All the dependence distances are constant, but distances between different 
dependence are different. The subscript expressions of the form al-fb aud 
al+c give rise to such dependence distances. The reduction factor. A, is given 

by 

A = min{ki,k2,. • . ,kn} 

• when the distances vary with different instances of dependence, A is calculated 
by the following formula. 

A = mini<i<„{<^(6,)} 

where <f>{) generates different instances of dependence 6,. In singly nested loops 
this happens when we have array subscripts of the form al-fb where a > 1 or 
a < — 1 . 

It is evident that the type of reduction factor depends upon the subscript expressions 
in the array references. 

Theorem 4.2 Consider a DO loop with k statements which are all involved in a 
dependence cycle. If the reduction factor of the cycle is A, then cycle shrinkiny 
increases the speed up of the loop by a factor of X* k. 
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Multiple nested loops: There are two versions of cycle shrinking that can be used 
for multiply nested loops. They differ in the way reduction factor is computed. 

• True dependence(TD) shrinking: True distance of each dependence can be 
calculated from the distance vector. Let (di, dj, . . . , d„) be the distance vector 
of a dependence 6. The true distance t is 

< = Ei- n w-ii+i) 

»=i j=i+i 

where s is the number of common loops enclosing the references. Let ( 
ti, fa, . . . ,fn) be the true distances of n dependences involved in a cycle, the 
reduction factor is 


A = mm(fi,f2,...,fn) 

In TD shrinking multi-dimensional array space is treated as a linear space. The 
loop limits must be known in order to compute the true distances. The values 
of the distances per se are not needed, but determining the minimum true 
distance in a cycle is essential for the TD shrinking. The following example 
demonstrates this transformation. 
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DO I - LuUi 
DO J » L2,U2 

Ad, J) - B(I-3, J-5) 
B(I, J) = B(I-2. J-4) 
ENDDO 
ENDDO 


DO K * 1. (Ui-Li + 1)(U2-L2 + 1). a 

Ti = {KDIV{U2 - 12 + 1)) + 

T2 = {(K + \)DIV{U2 - i2 + 1)) + Li 
n » k%{U2 -L2 + i) + l2-i 

T, = {{K + A)%(t/2 - + 1)) + ^2 - 2 

DOALL J * Ts, M 

A(Ti,J) - B(rx-3,J-5) 

B(Ti,J) = B(ri-2,J-4) 

ENDDO 

DOALL I = Ti + l,r2 - 1 
DOALL J * L2,U2 

A(I, J)=B(I-3, J-S) 

BCl, J)=B(I-2, J-4) 

ENDDO 

ENDDO 

DOALL J = L2,Ta 

A(T2,J) * B(T2-3,J-5) 

B(T2,J) * B(T2-2,J-4) 

ENDDO 

ENDDO 


T values for a loop nest having n normalized loops can be calculated by the 
formulas. 

T, = m + 1 ]forl<i<n 

T„. = {{k-{-X)divU%wUi) + l J 

Ti„ = {kmodUn) \ for i = n 

Tu„ = ((A: + - 1 J 
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Figure 4.4: The PDG before and after application of TD shrinking 
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• Selective shrinking: The dependence cycle within k nested loops can be viewed 
as k different dependence cycles, one for each individual loop. Each depen- 
dence in a cycle is labeled with the corresponding element of its distance vector. 
Selective shrinking computes the reduction factor A,-, (i = 1,2, . . . , k) for each 
loop in the nest starting with the outer most loop. The process stops when 
for some j, (1 < j < A:), Aj > 1. Then the loop in the loop nest is blocked 
by a factor of Aj. In addition, all the loops nested inside the loop are 
transformed to DOALLs. The following loop-nest illustrates this method. 


DO I » Li,Ui 
DO J - L2,U2 

A(I. J) « J-4) 

B(I. J) - A(I-2. J-5) 
ENDDO 
ENDDO 


DO I » Li,Ui 
DO K = L2,U2,i 

DOALL J * K,K + 2 
A(I, J) » J-4) 

B(I. J) = A(I-2, J-5) 


ENDDO 

ENDDO 

ENDDO 


Loop n«st 


Treats formed loop 



54 


Algorithm: Cycle Shrinking 

Input: see and bops enclosing SCO 
Output: Transformed code 
cycleshrinking(SCC) 

{ 

find the feasibility of cycle shrinking, 
determine type of shrinking and reduction factor, 
if (type of shrinking is selective shrinking) 

{ 

block the corresponding loop by reduction factor 
convert all the inner loops to parallel loops 
make chzuiges to PDG 

} 

else 

{ 

/* True dependence shrinking */ 

produce outer most loop with step, reduction factor 

produce first and second unrolled loops 

make replicate of strongly connected components 

replace loop variables by T variables and new loop variable 

make changes to PDG 

} 

} 

4.7 Loop Fusion 

Traditionally, loop fusion is used to increase the size of the loop body, thereby 
reducing the loop overhead. In parallel processors this one of the main goal, since 
scheduling of a loop onto various processors is very expensive [12]. 

In loop fusion, two loop bodies are combined to make a single loop. The two 
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loops must be at the same depth, called the depth of fusion. Moreover, the loops 
must satisfy the following conditions 

1. Both the loops must be of same type; i.e., both are either doall loops or do 
loops. 

2. The loop limits and steps of both the loops must be same. 

3. Both the loops must be executed under similar control conditions. 

4. No exit out of either of the loops is permitted. 

5. There should not be any cross iteration dependence at the depth of fusion. 

If two loops satisfy above mentioned criteria, then they can be fused. Consider the 
following loop-nest. 


DO I - 

1. M 



Sj: 

A(I) - B(I)+C(I) 

DO I » 1, 

N 

ENDDO 


5i: A(I) 

= B(I) + C(I) 

DO I - 

1 . » 

52: D(I) 

= A(I)*2 


DCI) - A(I)*2 

ENDDO 


ENDDO 





This loop- nest satisfies the above mentioned conditions and hence, it can be 
fused to a single loop. The effect of Loop fusion on PDG is shown in Figure 4.5. 
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Algorithm: Loop F\ision 

Input: set of SCCs 
Output: set of SCCs 
loopFusion(sccs, noofsccs, depth) 

{ 

initjsccq(sccs, noofsccs, sccq, inedgeno); 
whi!e((workingelement = pop-q(sccq)) != NULL) { 
enqueue(tempq2, workingelement); 
while{(testingelement = pop_q(sccq)) != NULL) { 

if( workingelement and testingelement are fuse able)then { 
make modification to the PDG 
fuse testingelement with workingelement. 

} 

els€{ 

enqueue(tempql, sccq); 

process-scc(sccs, noofsccs, testingelement, sccq); 

} 

} 

/* Reenqueue all the elements that were not fused to sccq */ 
while{(element = pop.q(tempql)) != NULL) 
enqueue(sccq(element)); 

remove data dependence edges with depth of dependence 
more than ‘depth’; 

initialize ownsccs with the elements in workingelement. 
loopFusion(ownsccs, noofownsccs, depth 4-1); 
while( (element = pop_q(tempq2)) != NULL) 
enqueue(sccq(element ) ) ; 

} 

} 


The Gcd and Banerjee tests are applied to determine cross over dependence at 
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the tlepth of fusion. If the dependence exists the two SCCs are not fusable. tempql 
is used to store SCCs that are not fused to the workingelement. tenipq2 is used to 
store the SCCs that are fused together with the working element. Once loop fusion 
is applied at a given depth of fusion, the routine is called recursively at depth+1. 



Chapter 5 
Epilogue 


5.1 Testing 

Till now we have been discussing different aspects of data dependence analysis, data 
dependence tests and various restructuring techniques and their implementation. 
Now we discuss testing that FRAMES has under gone. 

5.1.1 Test Programs 

FRAMES has been tested on various bench-mark programs extrew:ted from standard 
packages like UNPACK, Livermore Kernel. Moreover, many test programs had 
been written to test various aspects of implementation. These programs contain 
basically various types of loop nests. Loops in loop-nests vary from one to four. 
Care has been taken to include all types of array references such as coupled sub- 
scripts, subscript with single loop index and subscript with multiple loop indices. 
The number of dimensions of array references vary from one dimension to three 
dimensions. Some of the references contain symbolic variables in their subscripts. 
Still some bias is present in these written programs. While testing for restructuring 
we never considered loops with a scalar variable definition and use within the loop 
nest. IF statments with in the loop-nests are rare. 
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5.1.2 Output of FRAMES 

The output of data dei)eiKience analysis is a data dependence graph. The graph is 
printed in the following format: Each vertex and list of edges originating from that 
vertex. Result of each dependence test for each pair of array reference is written into 
file frames. log created in current directory. This will enable programmers to evaluate 
the performance of various dependence tests. Output of FRAMES is restructured 
code of the given program. The ensuing subsections manifest the capabilities of 
FRAMES. Each section is meant for a specific ability of the restructuring compiler. 

5.1.3 An Example for DDA 

The following example shows the effect of introducing sophisticated tests. Tradi- 
tional Banerjee’s and GCD test reported dependence for the array reference a. This 
clearly shows their inability to handle coupled subscripts. The Lambda and the 
Omega test reported independence which enabled the restructurer to apply Loop 
distribution. 

Input Program 

c Capability of Lambda and Omega test 
dimension aClOO, 100, 100), b(lOO) 
do i » 1, 10 

do j ■ 1, 20 

a(2*i+3*j+10, 3*i+j+9, i+j) = b(i, j) 
b(j, i) = a(i-j+ll,2*i-j+7, i+j) 
enddo 
enddo 
stop 
end 
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Output of FRAMES: 

% 

Restructured program : : 
c Capability of Lambda aoid Omega test 

dim aClOO.lOO.lOO). bClOO.lOO) 
forall i » 1, 10 

forall j » 1, 20 

a(2*i+3*j+10,3*i+j+9,i+j) = b(i,j) 
endfor 
endfor 

forall i ■ 1, 10 

forall j ■ 1, 20 

b(j,i) » a(i-j+ll,2*i-j+7,i+j) 
endfor 
endfor 
stop 
end 

5.1,4 An Example of Loop interchange 

The following example illustrates the loop distribution as well as loop interchange. 
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Input Program: 

c Loop interchange and Loop distribution 

dimension aClOO, 100), b(100, 100), c(100, 100) 


do j ■ 1, 100 




t 

O 

100 




aCi, 

j> - 

b(i, 

j- 

•1) 

b(i. 

j) - 

aCi, 

j- 

-1) 

c<i, 

j) • 

a(i. 

j- 

■1) 


enddo 


enddo 

stop 

end 

Output of FRAMES 

Restructured program : : *** 

c Loop interchange and Loop distribution 

dim aClOO.lOO), b(100,100), c(100,100) 
forall i*l,100 
do j *1,100 

b(i,j)“aCi.3“l) 

a(i,j)»bCi,j-l) 

enddo 

endfor 

forall j *1,100 

forall i- 1,100 

cCi,j)»aCi,j-2) 

endfor 

endfor 

stop 

end 

The first two statements in the input program form a cycle in the DDG. 


Hence 
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Output of FRAMES 

**♦ Restructured program : : *** 

c Loop Fusion 

dim a(100, 100), b(l00, 100, 100) 

dim e(l00, 100), c(100, 100), d(100, 100, 100) 

forall i = 1, 100 

forall j * 1, 100 
a(i, j) * 5.0 
c(i, j) = e(i, j) 
forall k = 1, 100 

b(i, j, k) = 10.0 
d(i, j, k) = 15.0 
endf or 
endfor 
endfor 
stop 
end 

There is no data dependence between the statements in the loop-nest. Hence the 
DDG contains only vertices without edges. Adjacent loops are fused to a single loop 
since there is no cross over dependences. 

5.1.6 A Nightmare 

The following example demonstrates the inability of FRAMES to extract the full 
amount of parallelism from the code. This is basically due to the ordering of 
transformations. 
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Input Program: 

c A nightmare of FRAMES 

dimension a(100,100), b(100,100), c(100,100) 
read(*,*) n 
do i * 1 , n 

do j =1, n 

a(i, j) = b(i-4, j) 
b(i, j) = a(i-5, j) + c(i, j) 
enddo 
enddo 
stop 
end 

There exists a dependence cycle between the two statements. The output of 

frames is given below. 

Output of FRAMES: 

c A nightmare of FRAMES 

dim a(100,100), b(100,100), c(100,100) 

read(*,*)n 

do j = 1 , n 

do i_nvar_l = 1, n, 4 

forall i = i_nvax_l, i_nvar_l+3 
a(i,j) = b(i-4,j) 
b(i,j) = a(i-5,j) + c(i,j) 
endfor 
enddo 
enddo 
stop 
enddo 

According to FRAMES, the Loop interchange is applied before cycle shrinking. In 
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the above example, Loop interchange is justified because the depth of dependence 
is increased as a result of loop interchange. The cycle shrinking could able to block 
the inner loop. If cycle shrinking is applied before loop interchange then outer loop 
is blocked, so that maximum number of iterations can be executed in parallel. 


5.2 Conclusions 

We have enhanced the data dependence analysis capabilities of FRAMES by pro- 
viding a frame work which takes results from dependence tests and builds a data 
dependence graph. The I, the Lambda, the Omega and the Power tests are imple- 
mented for this purpose. These new tests handle all types of subscript expressions 
[16] that appear in scientific programs. Restructuring capabilities of FRAMES are 
enhanced by incorporating loop interchange loop fusion and cycle shrinking. 


5.3 Future Developments 

Many more data dependence tests can be incorporated. For example, GA test 
[16] enumerates tha solutions for diophantine equations. This is of great help in 
extracting the fine grain parallelism from the code. Some of the dependence tests 
potential has not been fully exploited. For example, Omega test gives predicate 
expression in terms of symbolic variables whose values are not known at compile 
time. This information can be coded into the program. If the predicate is evaluated 
to truth value false then there is no dependence and the loop-nest can be executed 
in parallel. 

Data dependence across loop-nests is not determined. Let us consider loop 
nests LNi, LN^ and LNz in lexical order in a given pi-ogram. Assume there exists 
dependence from LN\ to LNz- And LNz is not dependent on any other loop-nest. 
In this case LNi and LNz can be executed in parallel if dependence across loop- nests 
is determined. This will certainly enhances the speed-up. 

The loop interchange is implemented only for perfectly nested loops with con- 
stant loop bounds. This can be further extended to imperfectly nested loops as 
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well as loop bounds defining trapezoidal region in iteration space. The goal of the 
present implementation is to move parallel inner loops to outer loops. There are 
many other goals, for example, making array references in the loop-nest unit stride 
which can be achieved by loop interchange. Many more restructuring techniques 
can be implemented. Main limitations of restructuring phase of FRAMES can be 
summarized as follows. 

• The inability to handle scalar dependences with in the loop-nest. 

• Restructuring techniques are ordered;Node splitting, Loop interchange. Cycle 
shrinking and Loopfusion. This order cannot be altered even if the program 
demands different ordering to produce efficient code. 

Better program representation which unifies control dependences and data depen- 
dences is still a research topic. This will enable us to handle scalar dependences 
and control statements within the loop-nest more efficiently. An algorithm which 
determines dynamically the best ordering of various transformations depending on 
the given program is yet to be developed. 
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